from __future__ import print_function, division
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle as pkl
import sklearn
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import svm
from sklearn.ensemble import VotingClassifier
from yellowbrick.classifier import DiscriminationThreshold
import xgboost as xgb
import lime.lime_tabular
import shap
import warnings
warnings.filterwarnings('ignore')
The following functions are loaded here and required during the construction of the models
def print_cv(cv):
print('\n'.join([str((x, y, z)) for (x, y, z) in zip(cv.cv_results_['params'], cv.cv_results_['mean_train_score'], cv.cv_results_['mean_test_score'])]))
def pretty_confusion_matrix(y, preds):
return pd.DataFrame(confusion_matrix(y, preds),
columns=pd.MultiIndex(levels=[['prediction'], ['0', '1']], labels=[[0, 0], [0, 1]]),
index=pd.MultiIndex(levels=[['target'], ['0', '1']], labels=[[0, 0], [0, 1]]))
#in case doen't work, change labels with codes
We import the starting dataset and define SeniorCitizen as a string and space as missing
df = pd.read_csv("TelcoChurn.csv", dtype={"SeniorCitizen":str},na_values=[' '])
df.head()
df.isna().sum()
By analyzing the missing values, we discover that only Totalcharges containes missing values codified as NaN. This happens when Tenure is equal to zero and, since Totalcharges is similar to the product between Tenure and Monthlycharges, we substitute these missings in Totalcharges with 0.
df.replace(np.nan,0,inplace=True)
df.isna().sum()
len(df.customerID.unique()) #we checked that were unique customers
df.describe()
df.dtypes
We decide to merge PhoneService and MultipleLines in PhoneLines because, combined, they provide the same information.
df.MultipleLines.replace(["No phone service","No","Yes"],["No", "Single", "Multiple"], inplace=True)
df.rename(columns={'MultipleLines':'PhoneLines'}, inplace=True)
df.drop("PhoneService", 1, inplace=True)
df.head()
We add a new column named TotalServices as the sum of the online services to analyze the cumulated effect of these services.
For this reason we temporarily change the configuration of the variables related to the services.
df["OnlineSecurity"] = df["OnlineSecurity"].apply(lambda x: 1 if x == 'Yes' else 0)
df["OnlineBackup"] = df["OnlineBackup"].apply(lambda x: 1 if x == 'Yes' else 0)
df["DeviceProtection"] = df["DeviceProtection"].apply(lambda x: 1 if x == 'Yes' else 0)
df["TechSupport"] = df["TechSupport"].apply(lambda x: 1 if x == 'Yes' else 0)
df["StreamingTV"] = df["StreamingTV"].apply(lambda x: 1 if x == 'Yes' else 0)
df["StreamingMovies"] = df["StreamingMovies"].apply(lambda x: 1 if x == 'Yes' else 0)
df['TotalServices']=df["OnlineSecurity"]+df["OnlineBackup"]+df["DeviceProtection"]+df["TechSupport"]+df["StreamingTV"]+df["StreamingMovies"]
df["OnlineSecurity"] = df["OnlineSecurity"].apply(lambda x: 'Yes' if x == 1 else 'No')
df["OnlineBackup"] = df["OnlineBackup"].apply(lambda x: 'Yes' if x == 1 else 'No')
df["DeviceProtection"] = df["DeviceProtection"].apply(lambda x: 'Yes' if x == 1 else 'No')
df["TechSupport"] = df["TechSupport"].apply(lambda x: 'Yes' if x == 1 else 'No')
df["StreamingTV"] = df["StreamingTV"].apply(lambda x: 'Yes' if x == 1 else 'No')
df["StreamingMovies"] = df["StreamingMovies"].apply(lambda x: 'Yes' if x == 1 else 'No')
df.head()
df.to_csv("TelcoChurnFinal.csv", index=False) #we saved the final dataframe of this part
df = pd.read_csv('TelcoChurnFinal.csv', dtype={'SeniorCitizen':str})
df.dtypes
We printed all the levels for each variable:
for item in df.columns:
print(item)
print (df[item].unique())
df.shape
df.columns
Before analyzing in a detailed way the models we performed, we present graphical analysis, both univariate and bivariate, of some of our data and variables in order to show some insights.
graph = df["Churn"].value_counts().plot(kind='bar', figsize=(10,7))
graph.set_ylabel("Number of Customers", fontsize=10)
plt.title("Churn distribution")
totals = []
for i in graph.patches:
totals.append(i.get_height())
total = sum(totals)
for i in graph.patches:
graph.text(i.get_x() +0.2, i.get_height() , \
str(i.get_height()), fontsize=15,
color='#444444')
plt.show()
As can be seen from the representation of the response variable, about one quarter of the customers in the dataset actually leave the company.
We are now going to display histograms to analyze the absolute frequency of the numerical variables.
df["tenure"].plot.hist(color='darkseagreen', alpha=0.7, bins=30)
plt.title("Frequency of tenure")
plt.show()
It seems that there are two peaks: one correspondent to low values of tenure and the other when it is equal to 72. We suppose this behaviour in the second peak is due to the fact that the dataset represents a snap-shot of the company's customers in a time span of 6 years; for this reason tenure equal to 72 represents the clients that leave the company in the last month of the time of evaluation or that remain all the period in the company.
df["MonthlyCharges"].plot.hist(color='darkseagreen', alpha=0.7, bins=30)
plt.title("Frequency of MonthlyCharges")
plt.show()
The majority of customers has a low value of the monthly rate.
df["TotalCharges"].plot.hist(color='darkseagreen', alpha=0.7, bins=30)
plt.title("Frequency of TotalCharges")
plt.show()
fig, axes = plt.subplots(1, 3, figsize=(17,5))
sns.boxplot(x=df["tenure"], orient="v", color="tab:blue",ax=axes[0])
sns.boxplot(x=df["MonthlyCharges"], orient="v", color="darkseagreen",ax=axes[1])
sns.boxplot(x=df["TotalCharges"], orient="v", color="coral",ax=axes[2])
plt.show()
In the following section, we will present bivariate graphical representation to better understand how the dataset's variables are structured and to try to extract some insights.
df.groupby(["gender", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(8,7))
plt.title("Gender count by Churn")
plt.show()
By looking at the variable gender, it seems like our dataset is well balanced between Female and Male: moreover, the proportions of Male and Female that leave appear to be homogeneous.
fig, axes = plt.subplots(1, 2, figsize=(18,6), sharey=True)
yes = df.tenure[df["Churn"]=="Yes"]
noyes = df.tenure[df["Churn"]=="No"]
axes[0].hist(yes, bins=20, color='darkseagreen')
axes[0].set_title("Tenure histogram for Churn = Yes")
axes[0].set(xlabel='Tenure', ylabel='Count')
axes[1].hist(noyes, bins=20, color='tab:blue')
axes[1].set_title("Tenure histogram for Churn = No")
axes[1].set(xlabel='Tenure', ylabel='Count')
plt.show()
By considering customers that leave the company, we notice that the majority of them has a low value of tenure. Probably this behaviour is linked to the fact that many people frequently change their service provider to take advantages from some offers with low initial tariffs.
On the other hand, there is not a specific pattern in the clients that don't leave the company; we only highlight a peak for high values of tenure.
df.groupby(["TotalServices", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(15,8))
plt.title("TotalServices count by Churn")
plt.show()
By considering the TotalServices variable that we have created, we can see that the churn proportions of customers having 0, 1 or 2 services is almost similar, while customers with more than 3 services tend to churn less.
fig, axes = plt.subplots(1, 2, figsize=(15,6), sharey=True)
yes = df.MonthlyCharges[df["Churn"]=="Yes"]
noyes = df.MonthlyCharges[df["Churn"]=="No"]
axes[0].hist(yes, bins=30, color='darkseagreen')
axes[0].set_title("MonthlyCharges histogram for Churn = Yes")
axes[0].set(xlabel='MonthlyCharges', ylabel='Count')
axes[1].hist(noyes, bins=30, color='tab:blue')
axes[1].set_title("MonthlyCharges histogram for Churn = No")
axes[1].set(xlabel='MonthlyCharges', ylabel='Count')
plt.show()
We can see that only difference present is for low value of MonthlyCharges: there is a peak only for people that remain as clients.
df.groupby(["InternetService", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(12,7))
plt.title("InternetServices count by Churn")
plt.show()
The proportion of customers that leave is lower for those that don't have any internet service: it may be because they are elder people and because since that they don't have such services maybe they have a lower monthly charge.
df[df.Churn=='Yes'].groupby(["SeniorCitizen","InternetService"]).size().unstack().plot(kind='bar', stacked=True, figsize=(20,5),color=['darkseagreen','tab:blue',"coral"])
plt.title('Yes Churn')
plt.show()
As we have supposed, senior citizens that don't have internet service are likely to remain in the company.
df.groupby(["PhoneLines", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(12,7))
plt.title("PhoneLines count by Churn")
plt.show()
By looking at this plot we can make the same assumption of the previous graph for InternetService.
df.groupby(["Partner", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(12,7))
plt.title("Partner count by Churn")
plt.show()
df.groupby(["Dependents", "Churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(12,7))
plt.title("Dependents count by Churn")
plt.show()
These last two plots give us information regarding the family status of the customers: as can be seen the proportion of who leaves is higher if doesn't have partner and dependents.
df = pd.read_csv('TelcoChurnFinal.csv', dtype={'SeniorCitizen':str})
df.head()
Removing customerID
df.drop('customerID', 1, inplace=True)
We use the function get_dummies with drop_first=True to have n-1 dummies for each variable. This passage is required to use sklearn to apply models.
We have decided to use this type of encoding to highlight the presence or not of particular service.
df = pd.get_dummies(df, drop_first=True)
df.columns
# correlation:
def show_correlations(dataframe, show_chart = True):
fig = plt.figure(figsize = (15,10))
corr = dataframe.corr()
if show_chart == True:
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,
annot=True)
return corr
correlation_df = show_correlations(df,show_chart=True)
df_copy = df.copy()
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.25, n_splits=1, random_state=1123)
for train_index, test_index in sss.split(df_copy, df_copy['Churn_Yes']):
print(len(train_index), len(test_index))
print(df_copy.loc[train_index, 'Churn_Yes'].mean(), df_copy.loc[test_index, 'Churn_Yes'].mean())
train = df_copy.loc[train_index]
test = df_copy.loc[test_index]
We use StratifiedShuffleSplit in order to maintain the same proportion (0.265) of the target variable in both train and test datasets.
# this piece of code exclude variables that are too correlated with the others
couples = {}
correlations=train.corr()
drop_c = []
for i, r in correlations.iterrows():
for j, x in r.iteritems():
if i != j and np.abs(x) > 0.9 and i not in drop_c and j not in drop_c:
couples[(i, j)] = (correlations.loc[i, j], correlations.loc[i, 'Churn_Yes'], correlations.loc[j, 'Churn_Yes'])
drop_c.append(j)
Since we use 0.9 as the cutoff to remove highly correlated variables, we don't drop any variable:
drop_c
The code below divides the target variable in both train and test
X = train.drop('Churn_Yes', 1)
Y = train['Churn_Yes']
X_test = test.drop('Churn_Yes', 1)
y_test = test['Churn_Yes']
The code below is used to drop some highly-correlated variables for the logistic regression. This is done because the logistic regression is a parametric model.
X_lr = X.drop(['TotalCharges','OnlineSecurity_Yes', 'OnlineBackup_Yes',
'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',
'StreamingMovies_Yes'], 1)
X_test_lr = X_test.drop(['TotalCharges','OnlineSecurity_Yes', 'OnlineBackup_Yes',
'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',
'StreamingMovies_Yes'], 1)
In this section we will perform different models. We used a cross-validation approach in order to choose the best parameters to train our model: for each model we first set a wider grid of parameters and, as we performe different cross-validations, we restrict the parameters' range and finally train the best model. The models resulting from cross-validation are compared by taking into account test AUC and its discrepancy with train AUC.
Then, for the evaluation of the models, we decide to change the threshold by selecting the one that maximize F1 Score.
lr_model = Pipeline([('scaler', RobustScaler()),
('model', LogisticRegression())])
#model training
lr_model.fit(X_lr, Y)
print('\n'.join([str((x, c)) for (x, c) in zip(X_lr.columns, lr_model.named_steps['model'].coef_[0])]))
#these are the values of the coefficient that we use for the logistic model
test_probs_tot = lr_model.predict_proba(X_test_lr)
test_preds = lr_model.predict(X_test_lr)
test_probs = [x_lr[1] for x_lr in test_probs_tot]
lr_probs = [y for (x, y) in lr_model.predict_proba(X_lr)]
lr_probs_test = [y for (x, y) in lr_model.predict_proba(X_test_lr)]
We now print the ROC curve and AUC value
lr_fpr_train, lr_tpr_train, _ = roc_curve(Y, lr_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(lr_fpr_train, lr_tpr_train, label='LogReg - train')
plt.plot(lr_fpr, lr_tpr, label='LogReg - test')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Logistic Regresion - ROC curve')
plt.legend(loc='best')
plt.show()
print('Train set auc: {}'.format(roc_auc_score(Y, lr_probs)))
print('Test set auc: {}'.format(roc_auc_score(y_test, lr_probs_test)))
with open('./models/lr_model.pkl', 'wb') as f:
pkl.dump(lr_model, f)
#saving the model
with open('./models/lr_model.pkl', 'rb') as f:
lr = pkl.load(f)
lr_preds = lr.predict(X_test_lr.drop(drop_c, 1))
lr_probs_train = [Y for (x, Y) in lr.predict_proba(X_lr.drop(drop_c, 1))]
lr_probs = [Y for (x, Y) in lr.predict_proba(X_test_lr.drop(drop_c, 1))]
lr_auc_train = roc_auc_score(Y, lr_probs_train)
lr_auc = roc_auc_score(y_test, lr_probs)
print(precision_score(y_test, lr_preds))
print(recall_score(y_test, lr_preds))
print(accuracy_score(y_test, lr_preds))
print(f1_score(y_test, lr_preds))
precision, recall, th = precision_recall_curve(y_test, lr_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.415, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
pretty_confusion_matrix(y_test, lr_preds)
the_cutoff = 0.3
new_lr_preds = [1 if x >= the_cutoff else 0 for x in lr_probs]
pretty_confusion_matrix(y_test, new_lr_preds)
print(precision_score(y_test, new_lr_preds))
print(recall_score(y_test, new_lr_preds))
print(accuracy_score(y_test, new_lr_preds))
print(f1_score(y_test, new_lr_preds))
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(lr)
visualizer.fit(X_lr, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.34
new_lr_preds = [1 if x >= the_cutoff else 0 for x in lr_probs]
pretty_confusion_matrix(y_test, new_lr_preds)
print(precision_score(y_test, new_lr_preds))
print(recall_score(y_test, new_lr_preds))
print(accuracy_score(y_test, new_lr_preds))
print(f1_score(y_test, new_lr_preds))
# This is a useful function to print the results
def print_cv(cv):
print('\n'.join([str((x, y, z)) for (x, y, z) in zip(cv.cv_results_['params'], cv.cv_results_['mean_train_score'], cv.cv_results_['mean_test_score'])]))
r_seed = 1123
dt = DecisionTreeClassifier(random_state=r_seed)
params = {'criterion': ['gini', 'entropy'],
'max_depth': np.arange(4, 15, 2),
'min_samples_leaf': [1, 0.01, 0.001]}
cv_tuning = RandomizedSearchCV(dt, params, random_state=r_seed, scoring='roc_auc',
return_train_score=True, cv=5, n_iter=20)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
params = {'criterion': ['gini', 'entropy'],
'max_depth': [4, 6, 8],
'min_samples_leaf': [0.1 ,0.01, 0.001]}
cv_tuning = RandomizedSearchCV(dt, params, random_state=r_seed, scoring='roc_auc',
return_train_score=True, cv=5, n_iter=20)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
dt = DecisionTreeClassifier(random_state=r_seed, criterion='entropy')
params = {'max_depth': [4, 5], 'min_samples_leaf': [0.1, 0.01]}
cv_tuning = GridSearchCV(dt, params, scoring='roc_auc', return_train_score=True, cv=5)
cv_tuning.fit(X, Y)
print('\n'.join([str((x, y, z)) for (x, y, z) in zip(cv_tuning.cv_results_['params'], cv_tuning.cv_results_['mean_train_score'], cv_tuning.cv_results_['mean_test_score'])]))
dt = DecisionTreeClassifier(random_state=r_seed, criterion='entropy',
max_depth=5, min_samples_leaf=0.01)
dt.fit(X, Y)
dt_probs = [y for (x, y) in dt.predict_proba(X)]
dt_probs_test = [y for (x, y) in dt.predict_proba(X_test)]
print('Train set auc: {}'.format(roc_auc_score(Y, dt_probs))) # train
print('Test set auc: {}'.format(roc_auc_score(y_test, dt_probs_test))) #test
dt_fpr_train, dt_tpr_train, _ = roc_curve(Y, dt_probs)
dt_fpr, dt_tpr, _ = roc_curve(y_test, dt_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(dt_fpr_train, dt_tpr_train, label='DTree - train')
plt.plot(dt_fpr, dt_tpr, label='DTree - test')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Decision Tree - ROC curve')
plt.legend(loc='best')
plt.show()
with open('./models/dt_model.pkl', 'wb') as f:
pkl.dump(dt, f)
with open('./models/dt_model.pkl', 'rb') as f:
dt = pkl.load(f)
dt_preds = dt.predict(X_test)
dt_probs_train = [y for (x, y) in dt.predict_proba(X)]
dt_probs = [y for (x, y) in dt.predict_proba(X_test)]
dt_auc_train = roc_auc_score(Y, dt_probs_train)
dt_auc = roc_auc_score(y_test, dt_probs)
print(precision_score(y_test, dt_preds))
print(recall_score(y_test, dt_preds))
print(accuracy_score(y_test, dt_preds))
print(f1_score(y_test, dt_preds))
precision, recall, th = precision_recall_curve(y_test, dt_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.385, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
pretty_confusion_matrix(y_test, dt_preds)
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(dt)
visualizer.fit(X, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.42
new_dt_preds = [1 if x >= the_cutoff else 0 for x in dt_probs]
pretty_confusion_matrix(y_test, new_dt_preds)
print(precision_score(y_test, new_dt_preds))
print(recall_score(y_test, new_dt_preds))
print(accuracy_score(y_test, new_dt_preds))
print(f1_score(y_test, new_dt_preds))
r_seed = 1123
rf = RandomForestClassifier(random_state=r_seed, n_estimators=50)
np.logspace(-1, -3, 3)
params = {'criterion': ['gini', 'entropy'],
'max_depth': [2, 3, 4, 5],
'min_samples_leaf': np.logspace(-1, -3, 3)} #[0.1, 0.01, 0.001]
cv_tuning = RandomizedSearchCV(rf, params, random_state=r_seed, scoring='roc_auc', return_train_score=True, cv=5, n_iter=20)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
rf = RandomForestClassifier(random_state=r_seed, n_estimators=50, criterion='entropy')
params = {'max_depth': [2, 3],
'min_samples_leaf': [0.01, 0.001, 1],
'max_features': ['sqrt', 0.2, 0.5]}
cv_tuning = RandomizedSearchCV(rf, params, random_state=r_seed, scoring='roc_auc', return_train_score=True, cv=5)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
rf = RandomForestClassifier(random_state=r_seed, n_estimators=50, criterion='entropy', max_depth=3)
params = {'min_samples_leaf': [0.01, 0.001],
'max_features': [0.1, "sqrt", 0.2, 0.3]}
cv_tuning = GridSearchCV(rf, params, scoring='roc_auc', return_train_score=True, cv=5)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
rf = RandomForestClassifier(random_state=r_seed, n_estimators=50,
criterion='entropy', max_depth=3,
min_samples_leaf=0.01, max_features=0.2,
warm_start=True)
# now we tune the number of estimator
for i in range(5):
rf.fit(X, Y)
probs = [y for (x, y) in rf.predict_proba(X)]
probs_test = [y for (x, y) in rf.predict_proba(X_test)]
print(rf.n_estimators, roc_auc_score(Y, probs), roc_auc_score(y_test, probs_test))
rf.n_estimators += 50
r_seed = 1123
rf = RandomForestClassifier(random_state=r_seed, n_estimators=200,
criterion='entropy', max_depth=3,
min_samples_leaf=0.01, max_features=0.2,
warm_start=True)
rf.fit(X, Y)
rf_probs = [y for (x, y) in rf.predict_proba(X)]
rf_probs_test = [y for (x, y) in rf.predict_proba(X_test)]
print(roc_auc_score(Y, rf_probs))
print(roc_auc_score(y_test, rf_probs_test))
# plot the feauture importances; this score is relative, so the total sums to one
feats = [(x, y) for x, y in zip(X.columns, rf.feature_importances_) if y > 0]
sorted(feats, reverse=True, key=lambda x: x[1])
rf_fpr_train, rf_tpr_train, _ = roc_curve(Y, rf_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(rf_fpr_train, rf_tpr_train, label='RForest - train')
plt.plot(rf_fpr, rf_tpr, label='RForest - test')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Random Forest - ROC curve')
plt.legend(loc='best')
plt.show()
with open('./models/rf_model.pkl', 'wb') as f:
pkl.dump(rf, f)
with open('./models/rf_model.pkl', 'rb') as f:
rf = pkl.load(f)
rf_preds = rf.predict(X_test.drop(drop_c, 1))
rf_probs_train = [Y for (x, Y) in rf.predict_proba(X.drop(drop_c, 1))]
rf_probs = [Y for (x, Y) in rf.predict_proba(X_test.drop(drop_c, 1))]
rf_auc_train = roc_auc_score(Y, rf_probs_train)
rf_auc = roc_auc_score(y_test, rf_probs)
print(precision_score(y_test, rf_preds))
print(recall_score(y_test, rf_preds))
print(accuracy_score(y_test, rf_preds))
print(f1_score(y_test, rf_preds))
precision, recall, th = precision_recall_curve(y_test, rf_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.36, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
pretty_confusion_matrix(y_test, rf_preds)
the_cutoff = 0.36
new_rf_preds = [1 if x >= the_cutoff else 0 for x in rf_probs]
pretty_confusion_matrix(y_test, new_rf_preds)
print(precision_score(y_test, new_rf_preds))
print(recall_score(y_test, new_rf_preds))
print(accuracy_score(y_test, new_rf_preds))
print(f1_score(y_test, new_rf_preds))
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(rf)
visualizer.fit(X, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.33
new_rf_preds = [1 if x >= the_cutoff else 0 for x in rf_probs]
pretty_confusion_matrix(y_test, new_rf_preds)
print(precision_score(y_test, new_rf_preds))
print(recall_score(y_test, new_rf_preds))
print(accuracy_score(y_test, new_rf_preds))
print(f1_score(y_test, new_rf_preds))
r_seed = 1234
gb = GradientBoostingClassifier(random_state=r_seed, n_estimators=50)
params = {'loss' :['deviance', 'exponential'],
'learning_rate' : [0.001,0.0001,0.01],
'subsample' : [1,0.9,0.8],
'max_features': ['sqrt','log2', 0.2, 0.5,0.8,1],
'min_samples_split' : np.logspace(-1, -3, 3),
}
cv_tuning = GridSearchCV(gb, params, scoring='roc_auc', return_train_score=True, cv=5, n_jobs=-1)
cv_tuning.fit(X, Y)
result_cv = pd.DataFrame([(p,tr,te) for (p,tr,te) in zip(cv_tuning.cv_results_['params'],cv_tuning.cv_results_['mean_train_score'],cv_tuning.cv_results_['mean_test_score']) if te > 0.82]).sort_values(2,ascending=[0])
result_cv
result_cv.reset_index(drop=True,inplace=True)
result_cv.loc[0][0]
{'learning_rate': 0.01, 'loss': 'exponential', 'max_features': 'log2', 'min_samples_split': 0.01, 'subsample': 1}
gb_best = GradientBoostingClassifier(learning_rate=0.01,
loss='exponential',
max_features = "log2",
min_samples_split = 0.01,
subsample = 1,
random_state=r_seed,
n_estimators=50)
gb_best.fit(X,Y)
gb_best.n_estimators = 50
scores = []
numest = []
for i in range(20):
gb_best.fit(X, Y)
probs = [y for (x, y) in gb_best.predict_proba(X)]
probs_test = [y for (x, y) in gb_best.predict_proba(X_test)]
print(gb_best.n_estimators, roc_auc_score(Y, probs), roc_auc_score(y_test, probs_test))
print(max(probs_test)) # maximum of the probability distribution.
scores.append(roc_auc_score(y_test, probs_test))
numest.append(gb_best.n_estimators)
gb_best.n_estimators += 50
plt.figure(1, figsize=(10, 7))
plt.plot(numest,scores,color = "tab:blue", linestyle='--', marker='o')
plt.xlabel('Number of estimators')
plt.ylabel('AUC test score')
plt.title('Gboost Auc score vs number of estimators')
r_seed = 1234
gb_best = GradientBoostingClassifier(learning_rate=0.01,
loss='exponential',
max_features = "log2",
min_samples_split = 0.01,
subsample = 1,
random_state=r_seed,
n_estimators=900)
gb_best.fit(X,Y)
gb_probs = [y for (x, y) in gb_best.predict_proba(X)]
gb_probs_test = [y for (x, y) in gb_best.predict_proba(X_test)]
print(roc_auc_score(Y, gb_probs))
print(roc_auc_score(y_test, gb_probs_test))
gb_fpr_train, gb_tpr_train, _ = roc_curve(Y, gb_probs)
gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(gb_fpr_train, gb_tpr_train, label='gboost - train', color = "tab:blue")
plt.plot(gb_fpr, gb_tpr, label='gboost - test', color = "darkseagreen")
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('gboost - ROC curve')
plt.legend(loc='best')
plt.show()
with open('gb_model.pkl', 'wb') as f:
pkl.dump(gb_best, f)
with open('gb_model.pkl', 'rb') as f:
gb = pkl.load(f)
gb_preds = gb.predict(X_test.drop(drop_c, 1))
gb_probs_train = [Y for (x, Y) in gb.predict_proba(X.drop(drop_c, 1))]
gb_probs = [Y for (x, Y) in gb.predict_proba(X_test.drop(drop_c, 1))]
gb_auc_train = roc_auc_score(Y, gb_probs_train)
gb_auc = roc_auc_score(y_test, gb_probs)
print(precision_score(y_test, gb_preds))
print(recall_score(y_test, gb_preds))
print(accuracy_score(y_test, gb_preds))
print(f1_score(y_test, gb_preds))
precision, recall, th = precision_recall_curve(y_test, gb_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.4, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
pretty_confusion_matrix(y_test, gb_preds)
the_cutoff = 0.4
new_gb_preds = [1 if x >= the_cutoff else 0 for x in gb_probs]
pretty_confusion_matrix(y_test, new_gb_preds)
print(precision_score(y_test, new_gb_preds))
print(recall_score(y_test, new_gb_preds))
print(accuracy_score(y_test, new_gb_preds))
print(f1_score(y_test, new_gb_preds))
visualizer = DiscriminationThreshold(gb)
visualizer.fit(X, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.35
new_gb_preds = [1 if x >= the_cutoff else 0 for x in gb_probs]
pretty_confusion_matrix(y_test, new_gb_preds)
print(precision_score(y_test, new_gb_preds))
print(recall_score(y_test, new_gb_preds))
print(accuracy_score(y_test, new_gb_preds))
print(f1_score(y_test, new_gb_preds))
xg_clas = xgb.XGBClassifier(n_estimators=50)
params = {'objective':['binary:logistic'],
'learning_rate': [0.001,0.0001,0.01],
'subsample': [1,0.9,0.8,0.7],
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': np.arange(0.4,0.8,0.1),
'min_child_weight': [1, 2, 3]
}
cv_tuning = GridSearchCV(xg_clas, params, scoring='roc_auc', return_train_score=True, cv=5, n_jobs=5)
cv_tuning.fit(X, Y)
result_cv = pd.DataFrame([(p,tr,te) for (p,tr,te) in zip(cv_tuning.cv_results_['params'],cv_tuning.cv_results_['mean_train_score'],cv_tuning.cv_results_['mean_test_score']) if te > 0.82]).sort_values(2,ascending=[0])
result_cv
result_cv.reset_index(drop=True,inplace=True)
result_cv.loc[0][0]
with open('xgb_cv_tuning.pkl', 'wb') as f:
pkl.dump(cv_tuning, f)
xgb_best = xgb.XGBClassifier(objective = "binary:logistic",
learning_rate=0.01 ,
subsample=0.8 ,
max_depth=6 ,
colsample_bytree=0.5 ,
min_child_weight =3 ,
n_estimators=50)
xgb_best.fit(X,Y)
xgb_best.n_estimators = 50
scores = []
numest = []
for i in range(20):
xgb_best.fit(X, Y)
probs = [y for (x, y) in xgb_best.predict_proba(X)]
probs_test = [y for (x, y) in xgb_best.predict_proba(X_test)]
print(xgb_best.n_estimators, roc_auc_score(Y, probs), roc_auc_score(y_test, probs_test))
print(max(probs_test)) # maximum of the probability distribution.
scores.append(roc_auc_score(y_test, probs_test))
numest.append(xgb_best.n_estimators)
xgb_best.n_estimators += 50
plt.figure(1, figsize=(10, 7))
plt.plot(numest,scores,color = "tab:blue", linestyle='--', marker='o')
plt.xlabel('Number of estimators')
plt.ylabel('AUC test score')
plt.title('XGboost Auc score vs number of estimators')
Final Model
xg_final = xgb.XGBClassifier(objective = "binary:logistic",
learning_rate=0.01 ,
subsample=0.8 ,
max_depth=6 ,
colsample_bytree=0.5 ,
min_child_weight =3 ,
n_estimators=450)
xg_final.fit(X,Y)
with open('xgb_final.pkl', 'wb') as f:
pkl.dump(xg_final, f)
xgb_probs = [y for (x, y) in xg_final.predict_proba(X)]
xgb_probs_test = [y for (x, y) in xg_final.predict_proba(X_test)]
print(roc_auc_score(Y, xgb_probs))
print(roc_auc_score(y_test, xgb_probs_test))
xgb_fpr_train, xgb_tpr_train, _ = roc_curve(Y, xgb_probs)
xgb_fpr, xgb_tpr, _ = roc_curve(y_test, xgb_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(xgb_fpr_train, xgb_tpr_train, label='Xgboost - train', color = "tab:blue")
plt.plot(xgb_fpr, xgb_tpr, label='Xgboost - test', color = "darkseagreen")
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Xgboost - ROC curve')
plt.legend(loc='best')
plt.show()
with open('xgb_final.pkl', 'rb') as f:
xgb = pkl.load(f)
xgb_preds = xgb.predict(X_test.drop(drop_c, 1))
xgb_probs_train = [Y for (x, Y) in xgb.predict_proba(X.drop(drop_c, 1))]
xgb_probs = [Y for (x, Y) in xgb.predict_proba(X_test.drop(drop_c, 1))]
xgb_auc_train = roc_auc_score(Y, xgb_probs_train)
xgb_auc = roc_auc_score(y_test, xgb_probs)
print(precision_score(y_test, xgb_preds))
print(recall_score(y_test, xgb_preds))
print(accuracy_score(y_test, xgb_preds))
print(f1_score(y_test, xgb_preds))
precision, recall, th = precision_recall_curve(y_test, xgb_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.425, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
def pretty_confusion_matrix(y, preds):
return pd.DataFrame(confusion_matrix(y, preds),
columns=pd.MultiIndex(levels=[['prediction'], ['0', '1']], codes=[[0, 0], [0, 1]]),
index=pd.MultiIndex(levels=[['target'], ['0', '1']], codes=[[0, 0], [0, 1]]))
pretty_confusion_matrix(y_test, xgb_preds)
the_cutoff = 0.425
new_xgb_preds = [1 if x >= the_cutoff else 0 for x in xgb_probs]
pretty_confusion_matrix(y_test, new_xgb_preds)
print(precision_score(y_test, new_xgb_preds))
print(recall_score(y_test, new_xgb_preds))
print(accuracy_score(y_test, new_xgb_preds))
print(f1_score(y_test, new_xgb_preds))
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(xgb)
visualizer.fit(X, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.36
new_xgb_preds = [1 if x >= the_cutoff else 0 for x in xgb_probs]
pretty_confusion_matrix(y_test, new_xgb_preds)
print(precision_score(y_test, new_xgb_preds))
print(recall_score(y_test, new_xgb_preds))
print(accuracy_score(y_test, new_xgb_preds))
print(f1_score(y_test, new_xgb_preds))
from xgboost import plot_importance
import xgboost as xgb
xgb.plot_importance(xg_final)
tuneX = X.copy()
tuneX_test = X_test.copy()
tuneX.drop(['DeviceProtection_Yes', 'StreamingTV_Yes',"OnlineBackup_Yes","StreamingMovies_Yes","TechSupport_Yes","OnlineSecurity_Yes"], axis=1, inplace=True)
tuneX_test.drop(['DeviceProtection_Yes', 'StreamingTV_Yes',"OnlineBackup_Yes","StreamingMovies_Yes","TechSupport_Yes","OnlineSecurity_Yes"], axis=1, inplace=True)
xgb_best = xgb.XGBClassifier(objective = "binary:logistic",
learning_rate=0.01 ,
subsample=0.7 ,
max_depth=6 ,
colsample_bytree=0.7 ,
min_child_weight =3 ,
n_estimators=50)
xgb_best.fit(tuneX,Y)
xgb_best.n_estimators = 50
scores = []
numest = []
for i in range(20):
xgb_best.fit(tuneX, Y)
probs = [y for (x, y) in xgb_best.predict_proba(tuneX)]
probs_test = [y for (x, y) in xgb_best.predict_proba(tuneX_test)]
print(xgb_best.n_estimators, roc_auc_score(Y, probs), roc_auc_score(y_test, probs_test))
print(max(probs_test)) # maximum of the probability distribution.
scores.append(roc_auc_score(y_test, probs_test))
numest.append(xgb_best.n_estimators)
xgb_best.n_estimators += 50
plt.figure(1, figsize=(10, 7))
plt.plot(numest,scores,color = "tab:blue", linestyle='--', marker='o')
plt.xlabel('Number of estimators')
plt.ylabel('AUC test score')
plt.title('XGboost Auc score vs number of estimators')
There is no difference in performance so we decided to keep all the variables, since in this case we don't have a memory space constraint and the fitting of the model is similar.
r_seed = 1
dsvm = svm.SVC(random_state=r_seed)
C_range = np.logspace(1,4 , 7)
params = {'C': C_range,
'kernel': [ 'linear','sigmoid','rbf'],
'tol' : [0.005,0.01,0.002,0.02]}
cv_tuning = RandomizedSearchCV(dsvm, params, random_state=r_seed, scoring='roc_auc',
return_train_score=True, cv=5, n_iter=20, n_jobs=-1)
cv_tuning.fit(X, Y)
print_cv(cv_tuning)
r_seed = 1
dsvm = svm.SVC(random_state=r_seed,tol=0.005, kernel='linear', C= 17.78279410038923,probability=True)
dsvm.fit(X, Y)
dsvm_probs = [y for (x, y) in dsvm.predict_proba(X)]
dsvm_probs_test = [y for (x, y) in dsvm.predict_proba(X_test)]
print(roc_auc_score(Y, dsvm_probs))
print(roc_auc_score(y_test, dsvm_probs_test))
dsvm_fpr_train, dsvm_tpr_train, _ = roc_curve(Y, dsvm_probs)
dsvm_fpr, dsvm_tpr, _ = roc_curve(y_test, dsvm_probs_test)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(dsvm_fpr_train, dsvm_tpr_train, label='RForest - train')
plt.plot(dsvm_fpr, dsvm_tpr, label='SVM - test')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('SVM - ROC curve')
plt.legend(loc='best')
plt.show()
with open('./models/svm_final.pkl', 'wb') as f:
pkl.dump(dsvm, f)
with open('./models/svm_final.pkl', 'rb') as f:
svm = pkl.load(f)
svm_preds = svm.predict(X_test.drop(drop_c, 1))
svm_probs_train = [Y for (x, Y) in svm.predict_proba(X.drop(drop_c, 1))]
svm_probs = [Y for (x, Y) in svm.predict_proba(X_test.drop(drop_c, 1))]
svm_auc_train = roc_auc_score(Y, svm_probs_train)
svm_auc = roc_auc_score(y_test, svm_probs)
print(precision_score(y_test, svm_preds))
print(recall_score(y_test, svm_preds))
print(accuracy_score(y_test, svm_preds))
print(f1_score(y_test, svm_preds))
For SVM we decide to take into account the following graph (differently from other models) due to computational complexity reasons.
precision, recall, th = precision_recall_curve(y_test, svm_probs)
plt.figure(1, figsize=(10, 7))
plt.plot(th, precision[1:], label="Precision",linewidth=3)
plt.plot(th, recall[1:], label="Recall",linewidth=3)
plt.title('Precision and recall for different thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.xlim((0,1))
plt.axvline(x=0.43, color = 'black', ls = 'dashed')
plt.axvline(x=0.5, color = 'deeppink', ls = 'dashed')
plt.legend()
plt.show()
def pretty_confusion_matrix(y, preds):
return pd.DataFrame(confusion_matrix(y, preds),
columns=pd.MultiIndex(levels=[['prediction'], ['0', '1']], codes=[[0, 0], [0, 1]]),
index=pd.MultiIndex(levels=[['target'], ['0', '1']], codes=[[0, 0], [0, 1]]))
pretty_confusion_matrix(y_test, svm_preds)
the_cutoff = 0.43
new_svm_preds = [1 if x >= the_cutoff else 0 for x in svm_probs]
pretty_confusion_matrix(y_test, new_svm_preds)
print(precision_score(y_test, new_svm_preds))
print(recall_score(y_test, new_svm_preds))
print(accuracy_score(y_test, new_svm_preds))
print(f1_score(y_test, new_svm_preds))
the_cutoff = 0.4
new_svm_preds = [1 if x >= the_cutoff else 0 for x in svm_probs]
pretty_confusion_matrix(y_test, new_svm_preds)
print(precision_score(y_test, new_svm_preds))
print(recall_score(y_test, new_svm_preds))
print(accuracy_score(y_test, new_svm_preds))
print(f1_score(y_test, new_svm_preds))
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import VotingClassifier # for classification: we put together the already estimated
# models and it will return a new probability distribution.
with open('./models/lr_model.pkl', 'rb') as f:
lr = pkl.load(f)
with open('./models/dt_model.pkl', 'rb') as f:
dt = pkl.load(f)
with open('./models/rf_model.pkl', 'rb') as f:
rf = pkl.load(f)
with open('gb_model.pkl', 'rb') as f:
gb = pkl.load(f)
with open('xgb_final.pkl', 'rb') as f:
xgb = pkl.load(f)
with open('./models/svm_final.pkl', 'rb') as f:
svm = pkl.load(f)
In this section, we will provide a comparison between the models we have fitted by discussing differences and similarities among them.
We display the results of different metrics for each model always considering the default threshold of 0.5.
lr_preds = lr.predict(X_test_lr.drop(drop_c, 1))
lr_probs_train = [y for (x, y) in lr.predict_proba(X_lr.drop(drop_c, 1))]
lr_probs = [y for (x, y) in lr.predict_proba(X_test_lr.drop(drop_c, 1))]
lr_auc_train = roc_auc_score(Y, lr_probs_train)
lr_auc = roc_auc_score(y_test, lr_probs)
dt_preds = dt.predict(X_test)
dt_probs_train = [y for (x, y) in dt.predict_proba(X)]
dt_probs = [y for (x, y) in dt.predict_proba(X_test)]
dt_auc_train = roc_auc_score(Y, dt_probs_train)
dt_auc = roc_auc_score(y_test, dt_probs)
rf_preds = rf.predict(X_test)
rf_probs_train = [y for (x, y) in rf.predict_proba(X)]
rf_probs = [y for (x, y) in rf.predict_proba(X_test)]
rf_auc_train = roc_auc_score(Y, rf_probs_train)
rf_auc = roc_auc_score(y_test, rf_probs)
gb_preds = gb.predict(X_test)
gb_probs_train = [y for (x, y) in gb.predict_proba(X)]
gb_probs = [y for (x, y) in gb.predict_proba(X_test)]
gb_auc_train = roc_auc_score(Y, gb_probs_train)
gb_auc = roc_auc_score(y_test, gb_probs)
xgb_preds = xgb.predict(X_test)
xgb_probs_train = [y for (x, y) in xgb.predict_proba(X)]
xgb_probs = [y for (x, y) in xgb.predict_proba(X_test)]
xgb_auc_train = roc_auc_score(Y, xgb_probs_train)
xgb_auc = roc_auc_score(y_test, xgb_probs)
svm_preds = svm.predict(X_test)
svm_probs_train = [y for (x, y) in svm.predict_proba(X)]
svm_probs = [y for (x, y) in svm.predict_proba(X_test)]
svm_auc_train = roc_auc_score(Y, svm_probs_train)
svm_auc = roc_auc_score(y_test, svm_probs)
plt.figure(figsize=(15, 8))
i = 230
for model_name, probs in zip(['LogReg', 'DecTree', 'RandomForest', 'GBoost', 'XGBoost', 'SVM'],
[lr_probs, dt_probs, rf_probs, gb_probs, xgb_probs, svm_probs]):
i += 1
plt.subplot(i)
plt.hist(probs)
plt.title(model_name)
plt.show()
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
dt_fpr, dt_tpr, _ = roc_curve(y_test, dt_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs)
xgb_fpr, xgb_tpr, _ = roc_curve(y_test, xgb_probs)
svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
We will now provide a dataframe that displays both train and test AUC for each model performed:
pd.DataFrame({'train_ROC': [lr_auc_train, dt_auc_train, rf_auc_train, gb_auc_train, xgb_auc_train, svm_auc_train],
'test_ROC': [lr_auc, dt_auc, rf_auc, gb_auc, xgb_auc, svm_auc]},
index=['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting', 'XGBoosting', 'Support Vector Machine'])
As can be seen from the table, by considering the train ROC we can say that all the models present an AUC of approximately 0.85, with the exception of the SVM that has a slightly lower AUC of 0.80 and the XGBoost that has 0.89; similar trends can be observed in the test ROC, where almost all the models present an AUC of 0.84 on average, again with the exception of the SVM that has a lower value (0.80). By analyzing these results, it is difficult to discriminate between the models and to prefer one model with respect to another one. In order to try to select the best model(s), we will later study and compare other metrics.
There seems to be no signs of overfitting, just a little for the XGBoosting model.
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(lr_fpr, lr_tpr, label='LogReg')
plt.plot(dt_fpr, dt_tpr, label='DecTree')
plt.plot(rf_fpr, rf_tpr, label='RandomForest')
plt.plot(gb_fpr, gb_tpr, label='GBoost')
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost')
plt.plot(svm_fpr, svm_tpr, label='SVM')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - test set')
plt.legend(loc='best')
plt.show()
As a starting point to measure the performance of our categorical classifiers, we will now display the confusion matrices for all the models. Our ideal goal is to reduce the number of False Negative (FN), so to reduce the number of observation that we classified as 0 (no Churn), but that actually resulted in 1 (Churn). However, we also want to take into account also precision: so F1 Score is the ultimate metric that we consider.
In order to try to reduce that value, we performed several confusion matrices for each model by varying the cut-off: here we present the ones having obtained by using the standard cut-off 0.5
pretty_confusion_matrix(y_test, lr_preds)
pretty_confusion_matrix(y_test, dt_preds)
pretty_confusion_matrix(y_test, rf_preds)
pretty_confusion_matrix(y_test, gb_preds)
pretty_confusion_matrix(y_test, xgb_preds)
pretty_confusion_matrix(y_test, svm_preds)
We have also analyzed the accuracy obtained for each model to understand the percentage of correctly classified observations over the total:
accuracies = []
for item in ['lr_preds','dt_preds','rf_preds','gb_preds', 'xgb_preds', 'svm_preds']:
a = accuracy_score(y_test, eval(item))
accuracies.append(a)
print('accuracy for {}: {}'.format(item,a))
Here it is complicated to prefer one or some models with respect to others because the values of the accuracy for these models are very similar and around 0.79 for all of them.
In order to try to discriminate between the models we have fitted, we now perform precision and recall metrics:
precisions = []
recalls = []
for item in ['lr_preds','dt_preds','rf_preds','gb_preds', 'xgb_preds', 'svm_preds']:
p = precision_score(y_test, eval(item))
r = recall_score(y_test, eval(item))
precisions.append(p)
recalls.append(r)
print('precision for {}: {}'.format(item,p))
print('recall for {}: {}'.format(item,r))
print('\n')
We can say that, by keeping the default cut-off (0.5) for each model, we can discriminate among them according to the recall. In order to give a more complete evaluation about the performance of our models, we decided to calculate also the f1 score.
We are now displaying the f1 score for each model:
# f1 score averages the precision and the recall of my model.
f1 = []
for item in ['lr_preds','dt_preds','rf_preds','gb_preds', 'xgb_preds', 'svm_preds']:
f = f1_score(y_test, eval(item))
f1.append(f)
print('f1-score for {}: {}'.format(item,f))
As can be seen, by using the standard cut-off, the Random Forest seems to be the model having the lowest f1 score.
As a final chapter of our project, we decided to make an ensemble between the model we have trained: we want to try to improve the results by combining the models and by using all the predictive capabilities of all the models that we have tuned.
In addition, we will also provide a summary table containing results and metrics for different models by using as cut-off the threshold suggested by the function DiscriminationThreshold from yellowbrick.classifier.
In order to perform the ensemble, we opted for a soft voting classifier (so we will use probabilities and not labels; a soft voting classifier predicts the class label based on the argmax of the sums of the predicted probabilities) because we have tuned and calibrated each single model. Moreover, due to both timing and lack of performance reasons with respect to other models, we decided not to include the SVM model in the ensemble.
Indeed, we have integrated in the ensemble only those models having an f1 score higher than 0.6.
ensemble = VotingClassifier(estimators=[('lr',lr), ('dt', dt), ('rf', rf), ('gb', gb),('xgb',xgb)], voting='soft')
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
# training:
ensemble.fit(X,Y)
# test on test set:
ensemble_preds = ensemble.predict(X_test)
ensemble_probs_train = [y for (x, y) in ensemble.predict_proba(X)]
ensemble_probs = [y for (x, y) in ensemble.predict_proba(X_test)]
ensemble_auc_train = roc_auc_score(Y, ensemble_probs_train)
ensemble_auc = roc_auc_score(y_test, ensemble_probs)
ensemble_f1_train = f1_score(Y, ensemble.predict(X))
ensemble_f1 = f1_score(y_test, ensemble_preds)
ensemble_precision_train = precision_score(Y, ensemble.predict(X))
ensemble_precision = precision_score(y_test, ensemble_preds)
ensemble_recall_train = recall_score(Y, ensemble.predict(X))
ensemble_recall = recall_score(y_test, ensemble_preds)
print('train auc: {}'.format(ensemble_auc_train))
print('test auc: {}'.format(ensemble_auc))
print('train f1: {}'.format(ensemble_f1_train))
print('test f1: {}'.format(ensemble_f1))
print('train recall: {}'.format(ensemble_recall_train))
print('test recall: {}'.format(ensemble_recall))
print('train precision: {}'.format(ensemble_precision_train))
print('test precision: {}'.format(ensemble_precision))
ensemble_fpr_train, ensemble_tpr_train, _ = roc_curve(Y, ensemble_probs_train)
ensemble_fpr, ensemble_tpr, _ = roc_curve(y_test, ensemble_probs)
plt.figure(1, figsize=(10, 7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(ensemble_fpr_train, ensemble_tpr_train, label='Ensemble - train')
plt.plot(ensemble_fpr, ensemble_tpr, label='Ensemble - test')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Ensemble - ROC curve')
plt.legend(loc='best')
plt.show()
visualizer = DiscriminationThreshold(ensemble)
visualizer.fit(X, Y) # Fit the data to the visualizer
visualizer.show()
the_cutoff = 0.29
new_ens_preds = [1 if x >= the_cutoff else 0 for x in ensemble_probs]
pretty_confusion_matrix(y_test, new_ens_preds)
print(precision_score(y_test, new_ens_preds))
print(recall_score(y_test, new_ens_preds))
print(accuracy_score(y_test, new_ens_preds))
print(f1_score(y_test, new_ens_preds))
final_results = {}
for model in ['lr','dt','rf','gb','xgb','ensemble']:
m = eval(model)
if model == 'lr':
preds= m.predict(X_lr.drop(drop_c, 1))
pred_probs = [x[1] for x in m.predict_proba(X_lr.drop(drop_c, 1))]
preds_test = m.predict(X_test_lr.drop(drop_c, 1))
pred_probs_test = [x[1] for x in m.predict_proba(X_test_lr.drop(drop_c, 1))]
else:
preds= m.predict(X)
pred_probs = [x[1] for x in m.predict_proba(X)]
preds_test= m.predict(X_test)
pred_probs_test = [x[1] for x in m.predict_proba(X_test)]
final_results[model] = {}
# auc test
final_results[model].update({'auc_test' : roc_auc_score(y_test, pred_probs_test)})
# auc train
final_results[model].update({'auc_train' : roc_auc_score(Y, pred_probs)})
# f1 test
final_results[model].update({'f1_test' : f1_score(y_test, preds_test)})
# f1 train
final_results[model].update({'f1_train' : f1_score(Y, preds)})
# precision test
final_results[model].update({'precision_test' : precision_score(y_test, preds_test)})
# precision train
final_results[model].update({'precision_train' : precision_score(Y, preds)})
# recall test
final_results[model].update({'recall_test' : recall_score(y_test, preds_test)})
# recall train
final_results[model].update({'recall_train' : recall_score(Y, preds)})
In the following table are collected the metrics for the models, with the default cut-off = 0.5
pd.DataFrame(final_results).T
The following code is used to display the summary table containing metrics for all the models and the ensemble by using the cut-off suggested by DiscriminationThreshold.
For our purposes, we prefer to use the threshold that maximize F1 Score for each model, instead of using the default (0.5) or the proportion of the target variable Churn (0.26).
final_results = {}
for model in ['lr', "dt", "rf",'dt','rf','gb','xgb','svm','ensemble']:
m = eval(model)
if model == 'lr':
pred_probs = [x[1] for x in m.predict_proba(X_lr.drop(drop_c, 1))]
preds = [1 if x >= 0.34 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test_lr.drop(drop_c, 1))]
preds_test = [1 if x >= 0.34 else 0 for x in pred_probs_test]
elif model == "dt":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.42 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.42 else 0 for x in pred_probs_test]
elif model == "rf":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.33 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.33 else 0 for x in pred_probs_test]
elif model == "gb":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.35 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.35 else 0 for x in pred_probs_test]
elif model == "xgb":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.36 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.36 else 0 for x in pred_probs_test]
elif model == "svm":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.43 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.43 else 0 for x in pred_probs_test]
elif model == "ensemble":
pred_probs = [x[1] for x in m.predict_proba(X.drop(drop_c, 1))]
preds = [1 if x >= 0.29 else 0 for x in pred_probs]
pred_probs_test = [x[1] for x in m.predict_proba(X_test.drop(drop_c, 1))]
preds_test = [1 if x >= 0.29 else 0 for x in pred_probs_test]
final_results[model] = {}
# auc test
final_results[model].update({'auc_test' : roc_auc_score(y_test, pred_probs_test)})
# auc train
final_results[model].update({'auc_train' : roc_auc_score(Y, pred_probs)})
# f1 test
final_results[model].update({'f1_test' : f1_score(y_test, preds_test)})
# f1 train
final_results[model].update({'f1_train' : f1_score(Y, preds)})
# precision test
final_results[model].update({'precision_test' : precision_score(y_test, preds_test)})
# precision train
final_results[model].update({'precision_train' : precision_score(Y, preds)})
# recall test
final_results[model].update({'recall_test' : recall_score(y_test, preds_test)})
# recall train
final_results[model].update({'recall_train' : recall_score(Y, preds)})
# accuracy test
final_results[model].update({'accuracy_test' : accuracy_score(y_test, preds_test)})
# accuracy train
final_results[model].update({'accuracy_train' : accuracy_score(Y, preds)})
pd.DataFrame(final_results).T
By looking at the chosen metrics, such as F1 Test and AUC Test, we can conclude that Gradient Boosting is the best model for predicting the churn of a current customer. The ensemble model reaches an F1 Score close to the Gradient Boosting algorithm but given the fact that the ensemble in built on top of the GB Model, we prefer to keep a simpler model.
It has been shown that the two boosting models (Gradient Boosting & Xgboost) are the better performing models that we have built. Although boosting is an effective technique for prediction, it has a downside which is the lack of interpretability.
To our aid come two libraries of so called "explainers": Lime and Shap.
Sources:
Lime is a technique built in order to explain the prediction made by any machine learning model. Clearly this becomes useful in the explanation of complex non-parametric models and black boxes; for example gradient boosting models or neural networks.
An explanation is created by approximating the underlying model locally by an interpretable one. Interpretable models are e.g. linear models with strong regularisation, decision tree’s, etc. The interpretable models are trained on small perturbations* of the original instance and should only provide a good local approximation.
The output of LIME is a list of features, reflecting the contribution of each feature to the prediction of a data sample. This provides local interpretability, and it also allows to determine which feature changes will have most impact on the prediction. The explainer so considers one observation at one time and shows not only the output of the classification but also what contributed the most and how.
Since we used this instance as an example we decided to show all the variables but normally it would be better to select the most important ones in order to have a clearer view.
explainer = lime.lime_tabular.LimeTabularExplainer(X.values,
feature_names=X.columns.values.tolist(),
class_names=Y.unique())
exp = explainer.explain_instance(X_test.iloc[0],
gb_best.predict_proba,
num_features=24)
exp.show_in_notebook(show_all=False)
Sources:
SHAP (SHapley Additive exPlanations) by Lundberg and Lee (2016) is a method to explain individual predictions. The idea is using game theory to interpret target model. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the “payout” (the prediction) among the features. A player can be an individual feature value, e.g. for tabular data.
The technical definition of a Shapley value is the “average marginal contribution of a feature value over all possible coalitions.” In other words, Shapley values consider all possible predictions for an instance using all possible combinations of inputs. Because of this exhaustive approach, SHAP can guarantee properties like consistency and local accuracy.
Shap specifies the explanation as:
$$ g(z')=\phi_0+\sum_{j=1}^M\phi_jz_j' $$
where g is the explanation model, z is the coalition vector, M is the maximum coalition size and phi is the feature attribution for a feature j, the Shapley values.
shap.initjs()
explainer = shap.TreeExplainer(gb_best)
shap_values = explainer.shap_values(X)
To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output.
shap.summary_plot(shap_values, X)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
The above explanation shows how each features contributes to the output value. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.
If we take many explanations such as the one shown above, rotate them 90 degrees, and then stack them horizontally, we can see explanations for an entire dataset (plot is interactive):
shap.force_plot(explainer.expected_value, shap_values, X)